Reactions � MIT MAS741 Context-Aware papers

Greg Detre

Sunday, September 15, 2002

 

Dawson, Experimental design

The Dawson paper serves two functions. It spells out what to avoid in experimental design in general, and it allows us to choose the best possible experimental design given the restraints under which we are operating. In the latter case, we could easily construct a flowchart to help navigate from ideal experimental designs which require significant control over subjects, implementation, testing and timing as well as resources and a large sample group, through to less ideal experimental scenarios involving fewer subjects, limitations on what sort of control group we can use etc.

LAFCam

I liked the idea of the LAFCam. It seems likely that it would do a reasonable job of highlighting the most important sequences of a home video, and importantly, it doesn�t require the user to make any extra effort or to be any smarter � indeed, if anything, it encourages them to be more engaged with the scene and less with the process.

The target domain of home video editing is well-chosen. Unfortunately though, it seems likely that making it marketable in the wider professional domain, for example, would be very difficult. If we consider professional editing of a television programme as an example, there are various hurdles that would have to be overcome.

The first is that the cameraman and the director/editor are different people. There may well be more than one cameraman, and they will almost certainly be required to play different roles and to take different shots and perspectives on the scene. However, they will all be viewing the same scene, which for most programmes will involve necessary lengthy periods with little emotional arousal and only occasional moments of excitement. Each cameraman will experience emotional arousal at the same time, so the system would need to target the director�s emotional response in some way. Perhaps the system could employ eye tracking to enable the director�s emotional response to a given camera to highlight certain shots/perspectives and speed up the editing process later.

Secondly, the prototype system relies on only two main measures of emotional arousal. In one sense, this is a minor point. But at the same time, adding sensors would bring the inherent problem of tying together multimodal input, which would probably require a much more complicated learning algorithm and general training approach.

Thirdly, I think we have to realise that there is actually a very solid ceiling which we cannot expect any number of sensors to push us above. If we imagine there were a homunculus inside the LAFCam, then I think that the input from the GSR, microphone and second camera would probably provide quite enough information to make pretty accurate choices about which shots to include, even without being able to see the scene being recorded. However, it seems to me that only a system that could pretty much pass the Turing test would be able to know whether a given burst of laughter signalled a climactic moment based on the last 3 seconds or 3 minutes of footage. In a similar way, I think it would be difficult for any foreseeable system to tell whether the cameraman�s speech is a sign that he�s fiddling with the pause button, lining people up for the shot or delightedly congratulating himself. Even an augmented LAFCam might never exceed more than 60% or 70% accuracy, no matter how many sensors it had. Getting that last few percent right is what context-awareness is all about, and I think it would require massively more context than is being considered here.

 

It seems almost as though there is a kind of �critical mass� of context-awareness, above which a system is able to make independent, valuable assessments incorporating both explicit and implicit information which could seriously aid the user. We might equally imagine that below this hypothetical critical mass, the system is simply unable to take into account the enormous variety of information that we regularly base mundane decisions upon, and focuses too rigidly upon a small number of explicitly-incorporated contextual clues. Obviously, the critical mass will vary for different problems, based on:

1)      the number and complexity of factors that a person would normally take into account

2)      the difficulty:

i)        for the designer to encode enough of them explicitly in a machine-comprehensible form

ii)      or, for the machine to learn the decision-making process for itself somehow.

The reason that this notion of a critical mass of context awareness may be important when considering how to make a system more context-aware is that if the system falls below this critical mass, it becomes at best defunct, or more likely, highly-obstructive. If we consider Microsoft Word features like the paperclip, or early spell-checkers, we can see that they simply did not take into account anywhere near enough of the context to be helpful, and indeed could be positively damaging if we allowed them to make decisions for us that turned out to be wrong. Below this critical mass, a halfway context-aware system can be more obstructive than a system which simply leaves it up to the user.

The LAFCam seems like a pretty successful prototype implementation of an interesting idea. However, if we take it as an example, we can see how it could be described as falling below the critical mass of useful context awareness in every domain except the narrow home-video market. We can see why this is so in terms of the main issues relating to the critical mass of a problem:

1)      The LAFCam effectively uses emotional arousal as a short-cut or rule of thumb with which to evaluate different sequences, but it ignores the underlying rich emotional and aesthetic context that gives one video sequence value over another. In any case where overall emotional arousal ceases to be a reliable guide, then the LAFCam becomes either defunct or obstructive. It might often pick incorrect sequences (either false positives or worse still, false negatives), requiring the user to watch all of the footage anyway just to be sure.

2)       

i)        We are in no position to try to encode explicitly the factors that people take into account when choosing which shots to include in a given video sequence. Outside the narrow band of home videos, getting a system to reliably take into account emotional or aesthetic context becomes a restatement of the full problem of AI.

ii)      The LAFCam speech and laughter recognition is based on a standard statistical learning algorithm, which has proven useful in a number of different areas. However, the LAFCam is not learning, for instance, to associate certain sounds (and their corresponding GSRs) with some rating system for how good a given shot is. Some enhancement like that would be the minimum necessary for us to say that it�s learning to take contextual clues into account for itself. Rather, the researchers have already made that choice for it, in the pre-categorised range of speech/laughter training samples to which it�s exposed.

 

My concern is that we currently underestimate how high the critical mass bar for useful context-aware systems is in most domains, and overestimate how generalisable the context that we supply explicitly for narrow domains will be. I can only see two main approaches to this problem in the intermediate future:

1.       Extensive experimentation to discover what rules of thumb are available for designers to incorporate into their systems. In the case of the LAFCam, we could build a prototype with an enormous array of sensors in the lab, and then see afterwards which sensors� readings correlated best with a subjective evaluation of the video footage, and which sensors complemented each other. These would be the ones that would be incoporated in the final, marketed product. In this way, we might find extremely valuable shortcut indicators of wider context, but these would probably only be valid within restricted domains similar to our experiment.

2.       In a related vein, we have a growing arsenal of self-organising and flexible learning algorithms, mostly based on statistical, evolutionary or connectionist principles, that I think would enable the system to fine-tune a weighting system for all of the different inputs and factors that it has at its disposal. This also has the virtue that the system could adapt itself according to a user-model which would vary from person to person, and a situation-model, depending on the particular task/domain in which it is used.

If we want to build systems that aren�t context-brittle, we need to be taking into account a genuinely enormous array of variables, which we cannot possibly fine-tune by hand. I would argue then that context-aware designers have to rely on either finding rules of thumb through experimentation which divide up this massive contextual problem space, or building systems that are good at navigating the fiendish, multi-dimensional landscape of this problem space for themselves.

More

moreover, a system that has more than the critical mass is able to learn to be even better

he pointed out that most of my worries are about expectations (being broken)